NHGH analysis

Katrine Meldgård, Margrethe Bøe Lysø, Kristine Rosted Petersen, Enrico Leonardi and Pernille Jensen

Introduction

NHANES glycohemoglobin data

  • National Health and Nutrition Examination Survey

Diabetes Mellitus (DM)

  • Type 1 Diabetes: Inefficient production of insulin.

  • Type 2 Diabetes: Inefficient utilization of insulin.

  • 422 million diagnosed, 1.5 million deaths each year

Aim

  • Correlation between biomarkers/measurements and diabetes
  • Possibility of regaining values after medication
  • How income classes influence getting diabetes and medication

Methods

  • Raw data: 6795 observations with 20 variables.
  • 01_load_data: Data has been loaded and splitted in metadata and measurements.
  • 02_clean_data (Data Wrangling): age values converted to integers, column names replaced with more meaningful names, unusable values set to NA and the data sets have been re-joined into one.
  • 03_augment: gender information converted to 0-1; conversions of units and introduction of ‘dm_status’ column.
# 01_load_data.qmd
library("tidyverse")
raw_data <- read_tsv(file = "https://hbiostat.org/data/repo/nhgh.tsv")
raw_data
# A tibble: 6,795 × 20
    seqn sex      age re        income    tx    dx    wt    ht   bmi   leg  arml
   <dbl> <chr>  <dbl> <chr>     <chr>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1 51624 male    34.2 Non-Hisp… [2500…     0     0  87.4  165.  32.2  41.5  40  
 2 51626 male    16.8 Non-Hisp… [4500…     0     0  72.3  181.  22    42    39.5
 3 51628 female  60.2 Non-Hisp… [1000…     1     1 117.   166   42.4  35.3  39  
 4 51629 male    26.1 Mexican … [2500…     0     0  97.6  173   32.6  41.7  38.7
 5 51630 female  49.7 Non-Hisp… [3500…     0     0  86.7  168.  30.6  37.5  36.1
 6 51633 male    80   Non-Hisp… [1500…     0     1  79.1  174.  26.0  42.8  40  
 7 51635 male    80   Non-Hisp… [1500…     1     1  89.6  180.  27.6  43    41.7
 8 51640 male    17.4 Other Hi… [1000…     0     0  74.7  170.  26.0  39.8  38.1
 9 51641 male    13   Non-Hisp… [7500…     0     0  40.6  156.  16.6  39.2  33.4
10 51643 female  43   Non-Hisp… [3500…     1     1 108.   164.  39.9  32.7  36.5
# ℹ 6,785 more rows
# ℹ 8 more variables: armc <dbl>, waist <dbl>, tri <dbl>, sub <dbl>, gh <dbl>,
#   albumin <dbl>, bun <dbl>, SCr <dbl>
# 02_clean_data.qmd
# Round age to the nearest
# full year
meta_data <- meta_data |>
  mutate(age = floor(age))
# Change arbitrary categories
# in the income values to NA
meta_data <- meta_data |> 
  mutate(income = case_when(
    income == '< 20000' ~ NA,
    income == '> 20000' ~ NA,
    .default = income
  ))
# 03_augment.qmd
# Rename columns in the measurement data 
measurements <- measurements |>
  rename("id" = "seqn",
         "medication" = "tx",
         "diagnosis" = "dx",
         "weight" = "wt",
         "height" = "ht",
         "tri_skinfold" = "tri",
         "sub_skinfold" = "sub",
         "creatinine" = "SCr")
# 03_augment.qmd
# Calculating "body fat percentage" and "conicity index".
# Add new variables
diabetes_data <- diabetes_data |> 
  mutate(bin_sex = case_when(
                    sex == "male" ~ 1,
                    sex == "female" ~ 0
                  ),
         bfp = 1.39*bmi+0.16*age-10.34*bin_sex-9,
         waist_m = waist/100,
         height_m = height/100,
         ci = waist_m/(0.109*sqrt(weight/(height_m))),
         dm_status = case_when(
                      diagnosis == 0 ~ 1,
                      diagnosis == 1 & medication == 0 ~ 2,
                      diagnosis == 1 & medication == 1 ~ 3
         ))

Descriptive analysis

Observations: 6795
Variables (augmented): 26
Diagnosed: 914
Medicated: 607

Data Visualization: Diabetes vs Income & Age

Some amount of data visualization can be very useful to better understand our data and the phenomena underlying the diabetes disease.

Biomarkers and diagnosis/medication status

Physical attributes and disease/medication status

PCA Analysis

  • Data
    • Non-medicated individuals
    • No observations with NA
    • Only anthropometric and biomarker measurements
  • Classes not seperated

Logistic regression model

  • Backwards selection:
    • Weight
    • Leg
    • Waist
    • Creatinine
    • Glycohemoglobin

Classification

  • Model based on parameters found by LR
  • Data
    • Non-medicated individuals
    • No observations with NA
  • ~all predicted as 0
  • AUC = 0.7750374

Discussion

  • Relation between anthropocentric- and biomarker measurements.
  • Classification and PCA: No clear relation.
  • Confusion matrix: Low prediction for diabetes.
  • Uneven distribution between diabetic and non-diabetic.
    • Improvement: Larger group of diabetic.
  • Income classes and medication.

Conclusion: Not any clear relation for diabetes.